CHAPTER 8 SORTING AND SEARCHING Two fundamental operations required of many applications are searching and sorting the data they operate on. Many different types of data are commonly sorted, such as customer names, payment due dates, or even a list of file names displayed in a file selection menu. If you are writing a programmer's cross reference utility, you may need to sort a list of variable names without regard to capitalization. In some cases, you may want to sort several pieces of related information based on the contents of only one of them. One example of that is a list of names and addresses sorted in ascending zip code order. Searching is equally important; for example, to locate a customer name in an array or disk file. In some cases you may wish to search for a complete match, while in others a partial match is needed. If you are searching a list of names for, say, Leonard, you probably would want to ignore Leonardo. But when searching a list of zip codes you may need to locate all that begin with the digits 068. There are many different ways sorting and searching can be accomplished, and the subject is by no means a simple one. Most programmers are familiar with the Bubble Sort, because it is the simplest to understand. Each adjacent pair of items is compared, and then exchanged if they are out of order. This process is repeated over and over, until the entire list has been examined as many times as there are items. Unfortunately, these repeated comparisons make the Bubble Sort an extremely poor performer. Similarly, code to perform a linear search that simply examines each item in succession for a match is easy to grasp, but it will be painfully slow when there are many items. In this chapter you will learn how sophisticated algorithms that handle these important programming chores operate. You will also learn how to sort data on more than one key. Often, it is not sufficient to merely sort a list of customers by their last name. For example, you may be expected to sort first by last name, then by first name, and finally by balance due. That is, all of the last names would first be sorted. Then within all of the Smiths you would sort again by first name, and for all of the John Smiths sort that subgroup based on how much money is owed. For completeness I will start each section by introducing sorting and searching methods that are easy to understand, and then progress to the more complex algorithms that are much more effective. Specifically, I will show the Quick Sort and Binary Search algorithms. When there are many thousands of data items, a good algorithm can make the difference between a sort routine that takes ten minutes to complete, and one that needs only a few seconds. Finally, I will discuss both BASIC and assembly language sort routines. As important as the right algorithm is for good performance, an assembly language implementation will be even faster. Chapter 12 describes how assembly language routines are written and how they work, and in this chapter I will merely show how to use the routines included with this book. SORTING FUNDAMENTALS ==================== Although there are many different ways to sort an array, the simplest sorting algorithm is the Bubble Sort. The name Bubble is used because a FOR/NEXT loop repeatedly examines each adjacent pair of elements in the array, and those that have higher values rise to the top like bubbles in a bathtub. The most common type of sort is ascending, which means that "A" comes before "B", which comes before "C", and so forth. Figure 8-1 shows how the name Zorba ascends to the top of a five-item list of first names. Initial array contents: Element 4 Kathy Element 3 Barbara Element 2 Cathy Element 1 Zorba < After 1 pass: Element 4 Kathy Element 3 Barbara Element 2 Zorba < Element 1 Cathy After 2 passes: Element 4 Kathy Element 3 Zorba < Element 2 Barbara Element 1 Cathy After 3 passes: Element 4 Zorba < Element 3 Kathy Element 2 Barbara Element 1 Cathy Figure 8.1: Data ascending a list during a bubble sort. The Bubble Sort routine that follows uses a FOR/NEXT loop to repeatedly examine an array and exchange elements as necessary, until all of the items are in the correct order. DEFINT A-Z DECLARE SUB BubbleSort (Array$()) CONST NumItems% = 20 CONST False% = 0 CONST True% = -1 DIM Array$(1 TO NumItems%) FOR X = 1 TO NumItems% READ Array$(X) NEXT CALL BubbleSort(Array$()) CLS FOR X = 1 TO NumItems% PRINT Array$(X) NEXT DATA Zorba, Cathy, Barbara, Kathy, Josephine DATA Joseph, Joe, Peter, Arnold, Glen DATA Ralph, Elli, Lucky, Rocky, Louis DATA Paula, Paul, Mary Lou, Marilyn, Keith END SUB BubbleSort (Array$()) STATIC DO OutOfOrder = False% 'assume it's sorted FOR X = 1 TO UBOUND(Array$) - 1 IF Array$(X) > Array$(X + 1) THEN SWAP Array$(X), Array$(X + 1) 'if we had to swap OutOfOrder = True% 'we may not be done END IF NEXT LOOP WHILE OutOfOrder END SUB This routine is simple enough to be self-explanatory, and only a few things warrant discussing. One is the OutOfOrder flag variable. When the array is nearly sorted to begin with, fewer passes through the loop are needed. The OutOfOrder variable determines when no more passes are necessary. It is cleared at the start of each loop, and set each time two elements are exchanged. If, after examining all of the elements in one pass no exchanges were required, then the sorting is done and there's no need for the DO loop to continue. The other item worth mentioning is that the FOR/NEXT loop is set to consider one element less than the array actually holds. This is necessary because each element is compared to the one above it. If the last element were included in the loop, then BASIC would issue a "Subscript out of range" error on the statement that examines Array$(X + 1). There are a number of features you can add to this Bubble Sort routine. For example, you could sort without regard to capitalization. In that case "adams" would come before "BAKER", even though the lowercase letter "a" has a higher ASCII value than the uppercase letter "B". To add that capability simply use BASIC's UCASE$ (or LCASE$) function as part of the comparisons: IF UCASE$(Array$(X)) > UCASE$(Array$(X + 1)) THEN And to sort based on the eight-character portion that starts six bytes into each string you would use this: IF MID$(Array$(X), 5, 8) > MID$(Array$(X + 1), 5, 8) THEN Although the comparisons in this example are based on just a portion of each string, the SWAP statement must exchange the entire elements. This opens up many possibilities as you will see later in this chapter. If there is a chance that the strings may contain trailing blanks that should be ignored, you can use RTRIM$ on each pair of elements: IF RTRIM$(Array$(X)) > RTRIM$(Array$(X + 1)) THEN Of course, you can easily combine these enhancements to consider only the characters in the middle after they have been converted to upper or lower case. Sorting in reverse (descending) order is equally easy; you'd simply replace the greater-than symbol (>) with a less-than symbol (<). Finally, you can modify the routine to work with any type of data by changing the array type identifier. That is, for every occurrence of Array$ you will change that to Array% or Array# or whatever is appropriate. If you are sorting a numeric array, then different modifications may be in order. For example, to sort ignoring whether the numbers are positive or negative you would use BASIC's ABS (absolute value) function: IF ABS(Array!(X)) > ABS(Array!(X + 1)) THEN It is important to point out that all of the simple modifications described here can also be applied to the more sophisticated sort routines we will look at later in this chapter. INDEXED SORTS Besides the traditional sorting methods--whether a Bubble Sort or Quick Sort or any other type of sort--there is another category of sort routine you should be familiar with. Where a conventional sort exchanges elements in an array until they are in order, an Index Sort instead exchanges elements in a parallel numeric array of *pointers*. The original data is left intact, so it may still be accessed in its natural order. However, the array can also be accessed in sorted order by using the element numbers contained in the index array. As with a conventional sort, the comparisons in an indexed sort routine examine each element in the primary array, but based on the element numbers in that index array. If it is determined that the data is out of order, the routine exchanges the elements in the index array instead of the primary array. A modification to the Bubble Sort routine to sort using an index is shown below. DEFINT A-Z DECLARE SUB BubbleISort (Array$(), Index()) CONST NumItems% = 20 CONST False% = 0 CONST True% = -1 DIM Array$(1 TO NumItems%) 'this holds the string data DIM Ndx(1 TO NumItems%) 'this holds the index FOR X = 1 TO NumItems% READ Array$(X) 'read the string data Ndx(X) = X 'initialize the index array NEXT CALL BubbleISort(Array$(), Ndx()) CLS FOR X = 1 TO NumItems% PRINT Array$(Ndx(X)) 'print based on the index NEXT DATA Zorba, Cathy, Barbara, Kathy, Josephine DATA Joseph, Joe, Peter, Arnold, Glen DATA Ralph, Elli, Lucky, Rocky, Louis DATA Paula, Paul, Mary lou, Marilyn, Keith SUB BubbleISort (Array$(), Index()) STATIC DO OutOfOrder = False% 'assume it's sorted FOR X = 1 TO UBOUND(Array$) - 1 IF Array$(Index(X)) > Array$(Index(X + 1)) THEN SWAP Index(X), Index(X + 1) 'if we had to swap OutOfOrder% = True% 'we're not done yet END IF NEXT LOOP WHILE OutOfOrder% END SUB In this indexed sort, all references to the data are through the index array. And when a swap is necessary, it is the index array elements that are exchanged. Note that an indexed sort requires that the index array be initialized to increasing values--even if the sort routine is modified to be descending instead of ascending. Therefore, when BubbleISort is called Ndx(1) must hold the value 1, Ndx(2) is set to 2, and so forth. In this example the index array is initialized by the caller. However, it would be just as easy to put that code into the subprogram itself. Since you can't pass an array that hasn't yet been dimensioned, it makes the most sense to do both steps outside of the subprogram. Either way, the index array must be assigned to these initial values. As I mentioned earlier, one feature of an indexed sort is that it lets you access the data in both its original and sorted order. But there are other advantages, and a disadvantage as well. The disadvantage is that each comparison takes slightly longer, because of the additional overhead required to first look up the element number in the index array, to determine which elements in the primary array will be compared. In some cases, though, that can be more than offset by requiring less time to exchange elements. If you are sorting an array of 230-byte TYPE variables, the time needed for SWAP to exchange the elements can become considerable. Every byte in both elements must be read and written, so the time needed increases linearly as the array elements become longer. Contrast that with the fixed two bytes in the integer index array that are swapped. Another advantage of an indexed sort is that it lends itself to sorting more data than can fit in memory. As you will see later in the section that shows how to sort files, it is far easier to manipulate an integer index than an entire file. Further, sorting the file data using multiple passes requires twice as much disk space as the file already occupies. DATA MANIPULATION TECHNIQUES Before I show the Quick Sort algorithm that will be used as a basis for the remaining sort examples in this chapter, you should also be aware of a few simple tricks that can help you maintain and sort your data. One was described in Chapter 6, using a pair of functions that pack and unpack dates such that the year is stored before the month, which in turn is before the day. Thus, date strings are reduced to only three characters each, and they can be sorted directly. Another useful speed-up trick is to store string data as integers or long integers. If you had a system of four-digit account numbers you could use an integer instead of a string. Besides saving half the memory and disk space, the integer comparisons in a sort routine will be many times faster than a comparison on string equivalents. Zip codes are also suited to this, and could be stored in a long integer. Even though the space savings is only one byte, the time needed to compare the values for sorting will be greatly reduced. This brings up another important point. As you learned in Chapter 2, all conventional (not fixed-length) strings require more memory than might be immediately apparent. Besides the amount of memory needed to hold the data itself, four additional bytes are used for a string descriptor, and two more beyond those for a back pointer. Therefore, a zip code stored as a string will actually require eleven bytes rather than the five you might expect. With this in mind, you may be tempted to think that using a fixed- length string to hold the zip codes will solve the problem. Since fixed- length strings do not use either descriptors or back pointers, they do not need the memory they occupy. And that leads to yet another issue. Whenever a fixed-length string or the string portion of a TYPE variable is compared, it must first be converted to a regular descriptored string. BASIC has only one string comparison routine, and it expects the addresses for two conventional string descriptors. Every time a fixed-length string is used as an argument for comparison, BASIC must create a temporary copy, call its comparison routine, and then delete the copy. This copying adds code and wastes an enormous amount of time; in many cases the copying will take longer than the comparison itself. Therefore, using integers and long integers for numeric data where possible will provide more improvement than just the savings in memory use. In some cases, however, you must use fixed-length string or TYPE arrays. In particular, when sorting information from a random access disk file it is most sensible to load the records into a TYPE array. And as you learned in Chapter 2, the string components of a TYPE variable or array element are handled by BASIC as a fixed-length string. So how can you effectively sort fixed-length string arrays without incurring the penalty BASIC's overhead imposes? With assembly language subroutines, of course! Rather than ask BASIC to pass the data using its normal methods, assembly language routines can be invoked passing the data segments and addresses directly. When you use SEG, or a combination of VARSEG and VARPTR with fixed-length and TYPE variables, BASIC knows that you want the segmented address of the variable or array element. Thus, you are tricking BASIC into not making a copy as it usually would when passing such data. An assembly language subroutine or function can be designed to accept data addresses in any number of ways. As you will see later when we discuss sorting on multiple keys, extra trickery is needed to do the same thing in a BASIC procedure. The three short assembly language functions that follow compare two portions of memory, and then return a result that can be tested by your program. ;COMPARE.ASM - compares two ranges of memory .Model Medium, Basic .Code Compare Proc Uses DS ES DI SI, SegAdr1:DWord, _ SegAdr2:DWord, NumBytes:Word Cld ;compare in the forward direction Mov SI,NumBytes ;get the address for NumBytes% Mov CX,[SI] ;put it into CX for comparing below Les DI,SegAdr1 ;load ES:DI with the first ; segmented address Lds SI,SegAdr2 ;load DS:SI with the second ; segmented address Repe Cmpsb ;do the compare Mov AX,0 ;assume the bytes didn't match Jne Exit ;we were right, skip over Dec AX ;wrong, decrement AX down to -1 Exit: Ret ;return to BASIC Compare Endp End ;COMPARE2.ASM - compares memory case-insensitive .Model Medium, Basic .Code Compare2 Proc Uses DS ES DI SI, SegAdr1:DWord, _ SegAdr2:DWord, NumBytes:Word Cld ;compare in the forward direction Mov BX,-1 ;assume the ranges are the same Mov SI,NumBytes ;get the address for NumBytes% Mov CX,[SI] ;put it into CX for comparing below Jcxz Exit ;if zero bytes were given, they're ; the same Les DI,SegAdr1 ;load ES:DI with the first address Lds SI,SegAdr2 ;load DS:SI with the second address Do: Lodsb ;load the current character from ; DS:SI into AL Call Upper ;capitalize as necessary Mov AH,AL ;copy the character to AH Mov AL,ES:[DI] ;load the other character into AL Inc DI ;point at the next one for later Call Upper ;capitalize as necessary Cmp AL,AH ;now, are they the same? Jne False ;no, exit now and show that Loop Do ;yes, continue Jmp Short Exit ;if we get this far, the bytes are ; all the same False: Inc BX ;increment BX to return zero Exit: Mov AX,BX ;assign the function output Ret ;return to BASIC Upper: Cmp AL,"a" ;is the character below an "a"? Jb Done ;yes, so we can skip it Cmp AL,"z" ;is the character above a "z"? Ja Done ;yes, so we can skip that too Sub AL,32 ;convert to upper case Done: Retn ;do a near return to the caller Compare2 Endp End ;COMPARE3.ASM - case-insensitive, greater/less than .Model Medium, Basic .Code Compare3 Proc Uses DS ES DI SI, SegAdr1:DWord, _ SegAdr2:DWord, NumBytes:Word Cld ;compare in the forward direction Xor BX,BX ;assume the ranges are the same Mov SI,NumBytes ;get the address for NumBytes% Mov CX,[SI] ;put it into CX for comparing below Jcxz Exit ;if zero bytes were given, they're ; the same Les DI,SegAdr1 ;load ES:DI with the first address Lds SI,SegAdr2 ;load DS:SI with the second address Do: Lodsb ;load the current character from ; DS:SI into AL Call Upper ;capitalize as necessary, remove for ; case-sensitive Mov AH,AL ;copy the character to AH Mov AL,ES:[DI] ;load the other character into AL Inc DI ;point at the next character for later Call Upper ;capitalize as necessary, remove for ; case-sensitive Cmp AL,AH ;now, are they the same? Loope Do ;yes, continue Je Exit ;we exhausted the data and they're ; the same Mov BL,1 ;assume block 1 was "greater" Ja Exit ;we assumed correctly Dec BX ;wrong, bump BX down to -1 Dec BX Exit: Mov AX,BX ;assign the function output Ret ;return to BASIC Upper: Cmp AL,"a" ;is the character below an "a"? Jb Done ;yes, so we can skip it Cmp AL,"z" ;is the character above a "z"? Ja Done ;yes, so we can skip that too Sub AL,32 ;convert to upper case Done: Retn ;do a near return to the caller Compare3 Endp End The first Compare routine above simply checks if all of the bytes are identical, and returns -1 (True) if they are, or 0 (False) if they are not. By returning -1 or 0 you can use either IF Compare%(Type1, Type2, NumBytes%) THEN or IF NOT Compare%(Type1, Type2, NumBytes%) THEN depending on which logic is clearer for your program. Compare2 is similar to Compare, except it ignores capitalization. That is, "SMITH" and Smith" are considered equal. The Compare3 function also compares memory and ignores capitalization, but it returns either -1, 0, or 1 to indicate if the first data range is less than, equal to, or greater than the second. The correct declaration and usage for each of these routines is shown below. Note that Compare and Compare2 are declared and used in the same fashion. Compare and Compare2: DECLARE FUNCTION Compare%(SEG Type1 AS ANY, SEG Type2 AS ANY, _ NumBytes%) Same = Compare%(Type1, Type2, NumBytes%) or DECLARE FUNCTION Compare%(BYVAL Seg1%, BYVAL Adr1%, BYVAL Seg2%, _ BYVAL Adr2%, NumBytes%) Same = Compare%(Seg1%, Adr1%, Seg2%, Adr2%, NumBytes%) Here, Same receives -1 if the two TYPE variables or ranges of memory are the same, or 0 if they are not. NumBytes% tells how many bytes to compare. Compare3: DECLARE FUNCTION Compare3%(SEG Type1 AS ANY, SEG Type2 AS ANY, _ NumBytes%) Result = Compare3%(Type1, Type2, NumBytes%) or DECLARE FUNCTION Compare3%(BYVAL Seg1%, BYVAL Adr1%, BYVAL Seg2%, _ BYVAL Adr2%, NumBytes%) Result = Compare3%(Seg1%, Adr1%, Seg2%, Adr2%, NumBytes%) Result receives 0 if the two type variables or ranges of memory are the same, -1 if the first is less when compared as strings, or 1 if the first is greater. NumBytes% tells how many bytes are to be to compared. In the context of a sort routine you could invoke Compare3 like this: IF Compare3%(TypeEl(X), TypeEl(X + 1), NumBytes%) = 1 THEN SWAP TypeEl(X), TypeEl(X + 1) END IF As you can see, these routines may be declared in either of two ways. When used with TYPE arrays the first is more appropriate and results in slightly less setup code being generated by the compiler. When comparing fixed-length strings or arbitrary blocks of memory (for example, when one of the ranges is on the display screen) you should use the second method. Since SEG does not work correctly with fixed-length strings, if you want to use that more efficient version you must create a dummy TYPE comprised solely of a single string portion: TYPE FixedLength Something AS STRING * 35 END TYPE Then simply use DIM to create a single variable or an array based on this or a similar TYPE, depending on what your program needs. The requirement to create a dummy TYPE was discussed in Chapter 2, and I won't belabor the reasons again here. These comparison routines will be used extensively in the sort routines presented later in this chapter; however, their value in other, non-sorting situations should also be apparent. Although these routines are written in assembly language, they are fairly simple to follow. It is important to understand that you do not need to know anything about assembly language to use them. All of the files you need to add these and all of the other routines in this book are contained on the accompanying diskette [here, in the same ZIP file as this text]. Chapter 12 discusses assembly language in great detail, and you can refer there for further explanation of the instructions used. If you plan to run the programs that follow in the QuickBASIC editor, you must load the BASIC.QLB Quick Library as follows: qb program /l basic Later when you compile these or other programs you must link with the parallel BASIC.LIB file: bc program [/o]; link program , , nul , basic; If you are using BASIC PDS start QBX using the BASIC7.QLB file, and then link with BASIC7.LIB to produce a stand-alone .EXE program. [VB/DOS users will also use the BASIC7 version.] THE QUICK SORT ALGORITHM ======================== It should be obvious to you by now that a routine written in assembly language will always be faster than an equivalent written in BASIC. However, simply translating a procedure to assembly language is not always the best solution. Far more important than which language you use is selecting an appropriate algorithm. The best sorting method I know is the Quick Sort, and a well-written version of Quick Sort using BASIC will be many times faster than an assembly language implementation of the Bubble Sort. The main problem with the Bubble Sort is that the number of comparisons required grows exponentially as the number of elements increases. Since each pass through the array exchanges only a few elements, many passes are required before the entire array is sorted. The Quick Sort was developed by C.A.R. (Tony) Hoare, and is widely recognized as the fastest algorithm available. In some special cases, such as when the data is already sorted or nearly sorted, the Quick Sort may be slightly slower than other methods. But in most situations, a Quick Sort is many times faster than any other sorting algorithm. As with the Bubble Sort, there are many different variations on how a Quick Sort may be coded. (You may have noticed that the Bubble Sort shown in Chapter 7 used a nested FOR/NEXT loop, while the one shown here uses a FOR/NEXT loop within a DO/WHILE loop.) A Quick Sort divides the array into sections--sometimes called partitions--and then sorts each section individually. Many implementations therefore use recursion to invoke the subprogram from within itself, as each new section is about to be sorted. However, recursive procedures in any language are notoriously slow, and also consume stack memory at an alarming rate. The Quick Sort version presented here avoids recursion, and instead uses a local array as a form of stack. This array stores the upper and lower bounds showing which section of the array is currently being considered. Another refinement I have added is to avoid making a copy of elements in the array. As a Quick Sort progresses, it examines one element selected arbitrarily from the middle of the array, and compares it to the elements that lie above and below it. To avoid assigning a temporary copy this version simply keeps track of the selected element number. When sorting numeric data, maintaining a copy of the element is reasonable. But when sorting strings--especially strings whose length is not known ahead of time--the time and memory required to keep a copy can become problematic. For clarity, the generic Quick Sort shown below uses the copy method. Although this version is meant for sorting a single precision array, it can easily be adapted to sort any type of data by simply changing all instances of the "!" type declaration character. '******** QSORT.BAS, Quick Sort algorithm demonstration 'Copyright (c) 1991 Ethan Winer DEFINT A-Z DECLARE SUB QSort (Array!(), StartEl, NumEls) RANDOMIZE TIMER 'generate a new series each run DIM Array!(1 TO 21) 'create an array FOR X = 1 TO 21 'fill with random numbers Array!(X) = RND(1) * 500 'between 0 and 500 NEXT FirstEl = 6 'sort starting here NumEls = 10 'sort this many elements CLS PRINT "Before Sorting:"; TAB(31); "After sorting:" PRINT "==============="; TAB(31); "==============" FOR X = 1 TO 21 'show them before sorting IF X >= FirstEl AND X <= FirstEl + NumEls - 1 THEN PRINT "==>"; END IF PRINT TAB(5); USING "###.##"; Array!(X) NEXT CALL QSort(Array!(), FirstEl, NumEls) LOCATE 3 FOR X = 1 TO 21 'print them after sorting LOCATE , 30 IF X >= FirstEl AND X <= FirstEl + NumEls - 1 THEN PRINT "==>"; 'point to sorted items END IF LOCATE , 35 PRINT USING "###.##"; Array!(X) NEXT SUB QSort (Array!(), StartEl, NumEls) STATIC REDIM QStack(NumEls \ 5 + 10) 'create a stack array First = StartEl 'initialize work variables Last = StartEl + NumEls - 1 DO DO Temp! = Array!((Last + First) \ 2) 'seek midpoint I = First J = Last DO 'reverse both < and > below to sort descending WHILE Array!(I) < Temp! I = I + 1 WEND WHILE Array!(J) > Temp! J = J - 1 WEND IF I > J THEN EXIT DO IF I < J THEN SWAP Array!(I), Array!(J) I = I + 1 J = J - 1 LOOP WHILE I <= J IF I < Last THEN QStack(StackPtr) = I 'Push I QStack(StackPtr + 1) = Last 'Push Last StackPtr = StackPtr + 2 END IF Last = J LOOP WHILE First < Last IF StackPtr = 0 THEN EXIT DO 'Done StackPtr = StackPtr - 2 First = QStack(StackPtr) 'Pop First Last = QStack(StackPtr + 1) 'Pop Last LOOP ERASE QStack 'delete the stack array END SUB Notice that I have designed this routine to allow sorting only a portion of the array. To sort the entire array you'd simply omit the StartEl and NumEls parameters, and assign First and Last from the LBOUND and UBOUND element numbers. That is, you will change these: First = StartEl and Last = StartEl + NumEls - 1 to these: First = LBOUND(Array!) and Last = UBOUND(Array!) As I mentioned earlier, the QStack array serves as a table of element numbers that reflect which range of elements is currently being considered. You will need to dimension this array to one element for every five elements in the primary array being sorted, plus a few extra for good measure. In this program I added ten elements, because one stack element for every five main array elements is not enough for very small arrays. For data arrays that have a large amount of duplicated items, you will probably need to increase the size of the stack array. Note that this ratio is not an absolute--the exact size of the stack that is needed depends on how badly out of order the data is to begin with. Although it is possible that one stack element for every five in the main array is insufficient in a given situation, I have never seen this formula fail. Because the stack is a dynamic integer array that is stored in far memory, it will not impinge on near string memory. If this routine were designed using the normal recursive method, BASIC's stack would be used which is in near memory. Each of the innermost DO loops searches the array for the first element in each section about the midpoint that belongs in the other section. If the elements are indeed out of order (when I is less than J) the elements are exchanged. This incrementing and comparing continues until I and J cross. At that point, assuming the variable I has not exceeded the upper limits of the current partition, the partition bounds are saved and Last is assigned to the top of the next inner partition level. When the entire partition has been processed, the previous bounds are retrieved, but as a new set of First and Last values. This process continues until no more partition boundaries are on the stack. At that point the entire array is sorted. On the accompanying disk you will find a program called SEEQSORT.BAS that contains an enhanced version of the QSort demo and subprogram. This program lets you watch the progress of the comparisons and exchanges as they are made, and actually see this complex algorithm operate. Simply load SEEQSORT.BAS into the BASIC editor and run it. A constant named Delay! is defined at the beginning of the program. Increasing its value makes the program run more slowly; decreasing it causes the program to run faster. AN ASSEMBLY LANGUAGE QUICK SORT As fast as the BASIC QuickSort routine is, we can make it even faster. The listing below shows an assembly language version that is between ten and twenty percent faster, depending on which compiler you are using and if the BASIC PDS /fs (far strings) option is in effect. ;SORT.ASM - sorts an entire BASIC string array .Model Medium, Basic .Data S DW 0 F DW 0 L DW 0 I DW 0 J DW 0 MidPoint DW 0 .Code Extrn B$SWSD:Proc ;this swaps two strings Extrn B$SCMP:Proc ;this compares two strings Sort Proc Uses SI DI ES, Array:Word, Dir:Word Cld ;all fills and compares are forward Push DS ;set ES = DS for string compares Pop ES Xor CX,CX ;clear CX Mov AX,7376h ;load AL and AH with the opcodes ; Jae and Jbe in preparation for ; code self-modification Mov BX,Dir ;get the sorting direction Cmp [BX],CX ;is it zero (ascending sort)? Je Ascending ;yes, skip ahead Xchg AL,AH ;no exchange the opcodes Ascending: Mov CS:[X1],AH ;install correct comparison opcodes Mov CS:[X2],AL ; based on the sort direction Mov BX,Array ;load the array descriptor address Mov AX,[BX+0Eh] ;save the number of elements Dec AX ;adjust the number to zero-based Jns L0 ;at least 1 element, continue Jmp L4 ;0 or less elements, get out now! L0: Mov BX,Array ;reload array descriptor address Mov BX,[BX] ;Array$(LBOUND) descriptor address Mov S,SP ;StackPtr = 0 (normalized to SP) Mov F,CX ;F = 0 Mov L,AX ;L = Size% ;----- calculate the value of MidPoint L1: Mov DI,L ;MidPoint = (L + F) \ 2 Add DI,F Shr DI,1 Mov MidPoint,DI Mov AX,F ;I = F Mov I,AX Mov AX,L ;J = L Mov J,AX ;----- calculate the offset into the descriptor table for Array$(MidPoint) L1_2: Shl DI,1 ;multiply MidPoint in DI times 4 Shl DI,1 ;now DI holds how far beyond Array$(Start) ; Array$(MidPoint)'s descriptor is Add DI,BX ;add the array base address to produce the final ; address for Array$(MidPoint) ;----- calculate descriptor offset for Array$(I) L2: Mov SI,I ;put I into SI Shl SI,1 ;as above Shl SI,1 ;now SI holds how far beyond Array$(Start) ; Array$(I)'s descriptor is Add SI,BX ;add the base to produce the final descriptor ; address ;IF Array$(I) < Array$(MidPoint) THEN I = I + 1: GOTO L2 Push BX ;save BX because B$SCMP trashes it Push SI Push DI Call B$SCMP ;do the compare Pop BX ;restore BX X1 Label Byte ;modify the code below to "Jbe" if descending sort Jae L2_1 ;Array$(I) isn't less, continue on Inc Word Ptr I ;I = I + 1 Jmp Short L2 ;GOTO L2 ;----- calculate descriptor offset for Array$(J) L2_1: Mov SI,J ;put J into SI Shl SI,1 ;as above Shl SI,1 ;now SI holds how far beyond Array$(Start) ; Array$(J)'s descriptor is Add SI,BX ;add the base to produce the final descriptor ; address ;IF Array$(J) > Array$(MidPoint) THEN J = J - 1: GOTO L2.1 Push BX ;preserve BX Push SI Push DI Call B$SCMP ;do the compare Pop BX ;restore BX X2 Label Byte ;modify the code below to "Jae" if descending sort Jbe L2_2 ;Array$(J) isn't greater, continue on Dec Word Ptr J ;J = J - 1 Jmp Short L2_1 ;GOTO L2.1 L2_2: Mov AX,I ;IF I > J GOTO L3 Cmp AX,J Jg L3 ;J is greater, go directly to L3 Je L2_3 ;they're the same, skip the swap ;Swap Array$(I), Array$(J) Mov SI,I ;put I into SI Mov DI,J ;put J into DI Cmp SI,MidPoint ;IF I = MidPoint THEN MidPoint = J Jne No_Mid1 ;not equal, skip ahead Mov MidPoint,DI ;equal, assign MidPoint = J Jmp Short No_Mid2 ;don't waste time comparing again No_Mid1: Cmp DI,MidPoint ;IF J = MidPoint THEN MidPoint = I Jne No_Mid2 ;not equal, skip ahead Mov MidPoint,SI ;equal, assign MidPoint = I No_Mid2: Mov SI,I ;put I into SI Shl SI,1 ;multiply times four for the Shl SI,1 ; for the descriptors Add SI,BX ;add address for first descriptor Mov DI,J ;do the same for J in DI Shl DI,1 Shl DI,1 Add DI,BX Push BX ;save BX because B$SWSD destroys it Call B$SWSD ;and swap 'em good Pop BX L2_3: Inc Word Ptr I ;I = I + 1 Dec Word Ptr J ;J = J - 1 Mov AX,I ;IF I <= J GOTO L2 Cmp AX,J Jg L3 ;it's greater, skip to L3 Mov DI,MidPoint ;get MidPoint again Jmp L1_2 ;go back to just before L2 L3: Mov AX,I ;IF I < L THEN PUSH I: PUSH L Cmp AX,L Jnl L3_1 ;it's not less, so skip Pushes Push I ;Push I Push L ;Push L L3_1: Mov AX,J ;L = J Mov L,AX Mov AX,F ;IF F < L GOTO L1 Cmp AX,L Jnl L3_2 ;it's not less, jump ahead to L3_2 Jmp L1 ;it's less, go to L1 L3_2: Cmp S,SP ;IF S = 0 GOTO L4 Je L4 Pop L ;Pop L Pop F ;Pop F Jmp L1 ;GOTO L1 L4: Ret ;return to BASIC Sort Endp End Besides being faster than the BASIC version, the assembly language Sort routine is half the size. This version also supports sorting either forward or backward, but not just a portion of an array. The general syntax is: CALL Sort(Array$(), Direction) Where Array$() is any variable-length string array, and Direction is 0 for ascending, or any other value for descending. Note that this routine calls upon BASIC's internal services to perform the actual comparing and swapping; therefore, the exact same code can be used with either QuickBASIC or BASIC PDS. Again, I refer you forward to Chapter 12 for an explanation of the assembly language commands used in SORT.ASM. SORTING ON MULTIPLE KEYS In many situations, sorting based on one key is sufficient. For example, if you are sorting a mailing list to take advantage of bulk rates you must sort all of the addresses in order by zip code. When considering complex data such as a TYPE variable, it is easy to sort the array based on one component of each element. The earlier Bubble Sort example showed how MID$ could be used to consider just a portion of each string, even though the entire elements were exchanged. Had that routine been designed to operate on a TYPE array, the comparisons would have examined just one component, but the SWAP statements would exchange entire elements: IF Array(X).ZipCode > Array(X + 1).ZipCode THEN SWAP Array(X), Array(X + 1) END IF This way, each customer's last name, first name, street address, and so forth remain connected to the zip codes that are being compared and exchanged. There are several ways to sort on more than one key, and all are of necessity more complex than simply sorting based on a single key. One example of a multi-key sort first puts all of the last names in order. Then within each group of identical last names the first names are sorted, and within each group of identical last and first names further sorting is performed on yet another key--perhaps Balance Due. As you can see, this requires you to sort based on differing types of data, and also to compare ranges of elements for the subgroups that need further sorting. The biggest complication with this method is designing a calling syntax that lets you specify all of the information. A table array must be established to hold the number of keys, the type of data in each key (string, double precision, and so forth), and how many bytes into the TYPE element each key portion begins. Worse, you can't simply use the name of a TYPE component in the comparisons inside the sort routine--which would you use: Array(X).LastName, Array(X).FirstName, or Array(X).ZipCode? Therefore, a truly general multi-key sort must be called passing the address where the array begins in memory, and a table of offsets beyond that address where each component being considered is located. To avoid this added complexity I will instead show a different method that has only a few minor restrictions, but is much easier to design and understand. This method requires you to position each TYPE component into the key order you will sort on. You will also need to store all numbers that will be used for a sort key as ASCII digits. To sort first on last name, then first name, and then on balance due, the TYPE might be structured as follows: TYPE Customer LastName AS STRING * 15 FirstName AS STRING * 15 BalanceDue AS STRING * 9 Street AS STRING * 32 City AS STRING * 15 State AS STRING * 2 ZipCode AS STRING * 5 AnyNumber AS DOUBLE END TYPE In most cases the order in which each TYPE member is placed has no consequence. When you refer to TypeVar.LastName, BASIC doesn't care if LastName is defined before or after FirstName in the TYPE structure. Either way it translates your reference to LastName into an address. Having to store numeric data as strings is a limitation, but this is needed only for those TYPE fields that will be used as a sort key. The key to sorting on multiple items simultaneously is by treating the contiguous fields as a single long field. Since assignments to the string portion of a TYPE variable are handled internally by BASIC's LSET routine, the data in each element will be aligned such that subsequent fields can be treated as an extension of the primary field. Figure 8-2 below shows five TYPE array elements in succession, as they would be viewed by a string comparison routine. This data is defined as a subset of the name and address TYPE shown above, using just the first three fields. Notice that the balance due fields must be right-aligned (using RSET) for the numeric values to be considered correctly. Type.LastName Type.FirstName Type.BalanceDue ===============---------------========= Munro Jay 8000.00 Smith John 122.03 Johnson Alfred 14537.89 Rasmussen Peter 100.90 Hudson Cindy 21.22 ^ ^ ^ Field 1 Field 2 Field 3 starts here starts here starts here Figure 8-2: Multiple contiguous fields in a TYPE can be treated as a single long field. Thus, the sort routine would be told to start at the first field, and consider the strings to be 15 + 15 + 9 = 39 characters long. This way all three fields are compared at one time, and treated as a single entity. Additional fields can of course follow these, and they may be included in the comparison or not at your option. The combination demonstration and subroutine below sorts such a TYPE array on any number of keys using this method, and it has a few additional features as well. Besides letting you confine the sorting to just a portion of the array, you may also specify how far into each element the first key is located. As long as the key fields are contiguous, they do not have to begin at the start of each TYPE. Therefore, you could sort just on the first name field, or on any other field or group of fields. 'TYPESORT.BAS - performs a multi-key sort on TYPE arrays 'Copyright (c) 1991 Ethan Winer DEFINT A-Z DECLARE FUNCTION Compare3% (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, _ BYVAL Adr2, NumBytes) DECLARE SUB SwapMem (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, BYVAL Adr2, _ BYVAL Length) DECLARE SUB TypeSort (Segment, Address, ElSize, Offset, KeySize, NumEls) CONST NumEls% = 23 'this keeps it all on the screen TYPE MyType LastName AS STRING * 10 FirstName AS STRING * 10 Dollars AS STRING * 6 Cents AS STRING * 2 END TYPE REDIM Array(1 TO NumEls%) AS MyType '---- Disable (REM out) all but one of the following blocks to test Offset = 27 'start sorting with Cents ElSize = LEN(Array(1)) 'the length of each element KeySize = 2 'sort on the Cents only Offset = 21 'start sorting with Dollars ElSize = LEN(Array(1)) 'the length of each element KeySize = 8 'sort Dollars and Cents only Offset = 11 'start sorting with FirstName ElSize = LEN(Array(1)) 'the length of each element KeySize = 18 'sort FirstName through Cents Offset = 1 'start sorting with LastName ElSize = LEN(Array(1)) 'the length of each element KeySize = ElSize 'sort based on all 4 fields FOR X = 1 TO NumEls% 'build the array from DATA READ Array(X).LastName READ Array(X).FirstName READ Amount$ 'format the amount into money Dot = INSTR(Amount$, ".") IF Dot THEN RSET Array(X).Dollars = LEFT$(Amount$, Dot - 1) Array(X).Cents = LEFT$(MID$(Amount$, Dot + 1) + "00", 2) ELSE RSET Array(X).Dollars = Amount$ Array(X).Cents = "00" END IF NEXT Segment = VARSEG(Array(1)) 'show where the array is Address = VARPTR(Array(1)) ' located in memory CALL TypeSort(Segment, Address, ElSize, Offset, KeySize, NumEls%) CLS 'display the results FOR X = 1 TO NumEls% PRINT Array(X).LastName, Array(X).FirstName, PRINT Array(X).Dollars; "."; Array(X).Cents NEXT DATA Smith, John, 123.45 DATA Cramer, Phil, 11.51 DATA Hogan, Edward, 296.08 DATA Cramer, Phil, 112.01 DATA Malin, Donald, 13.45 DATA Cramer, Phil, 111.3 DATA Smith, Ralph, 123.22 DATA Smith, John, 112.01 DATA Hogan, Edward, 8999.04 DATA Hogan, Edward, 8999.05 DATA Smith, Bob, 123.45 DATA Cramer, Phil, 11.50 DATA Hogan, Edward, 296.88 DATA Malin, Donald, 13.01 DATA Cramer, Phil, 111.1 DATA Smith, Ralph, 123.07 DATA Smith, John, 112.01 DATA Hogan, Edward, 8999.33 DATA Hogan, Edward, 8999.17 DATA Hogan, Edward, 8999.24 DATA Smith, John, 123.05 DATA Cramer, David, 1908.80 DATA Cramer, Phil, 112 END SUB TypeSort (Segment, Address, ElSize, Displace, KeySize, NumEls) STATIC REDIM QStack(NumEls \ 5 + 10) 'create a stack array First = 1 'initialize working variables Last = NumEls Offset = Displace - 1 'decrement once now rather than ' repeatedly later DO DO Temp = (Last + First) \ 2 'seek midpoint I = First J = Last DO WHILE Compare3%(Segment, Address + Offset + (I - 1) * ElSize, Segment, _ Address + Offset + (Temp-1) * ElSize, KeySize) = -1 '< 1 for descending I = I + 1 WEND WHILE Compare3%(Segment, Address + Offset + (J - 1) * ElSize, Segment, _ Address + Offset + (Temp-1) * ElSize, KeySize) = 1 '< -1 for descending J = J - 1 WEND IF I > J THEN EXIT DO IF I < J THEN CALL SwapMem(Segment, Address + (I - 1) * ElSize, Segment, _ Address + (J - 1) * ElSize, ElSize) IF Temp = I THEN Temp = J ELSEIF Temp = J THEN Temp = I END IF END IF I = I + 1 J = J - 1 LOOP WHILE I <= J IF I < Last THEN QStack(StackPtr) = I 'Push I QStack(StackPtr + 1) = Last 'Push Last StackPtr = StackPtr + 2 END IF Last = J LOOP WHILE First < Last IF StackPtr = 0 THEN EXIT DO 'Done StackPtr = StackPtr - 2 First = QStack(StackPtr) 'Pop First Last = QStack(StackPtr + 1) 'Pop Last LOOP ERASE QStack 'delete the stack array END SUB As you can see, this version of the Quick Sort subprogram is derived from the one shown earlier. The important difference is that all of the incoming information is passed as segments, addresses, and bytes, rather than using an explicit array name. But before describing the inner details of the subprogram itself, I'll address the demonstration portion and show how the routine is set up and called. As with some of the other procedures on the disk that comes with this book, you will extract the TypeSort subprogram and add it to your own programs by loading it as a module, and then using the Move option of BASIC's View Subs menu. You can quickly access this menu by pressing F2, and then use Alt-M to select Move. Once this is done you will unload TYPESORT.BAS using the Alt-F-U menu selection, and answer *No* when asked if you want to save the modified file. You could also copy the TypeSort subprogram into a separate file, and then load that file as a module in each program that needs it. Although the example TYPE definition here shows only four components, you may of course use any TYPE structure. TypeSort expects six parameters to tell it where in memory the array is located, how far into each element the comparison routines are to begin, the total length of each element, the length of the key fields, and the number of elements to sort. After defining MyType, the setup portion of TYPESORT.BAS establishes the offset, element size, and key size parameters. As you can see, four different sample setups are provided, and you should add remarking apostrophes to all but one of them. If the program is left as is, the last setup values will take precedence. The next section reads sample names, addresses and dollar amounts from DATA statements, and formats the dollar amounts as described earlier. The dollar portion of the amounts are right justified into the Dollars field of each element, and the Cents portion is padded with trailing zeros as necessary to provide a dollars and cents format. This way, the value 12.3 will be assigned as 12.30, and 123 will be formatted to 123.00 which gives the expected appearance. The final setup step is to determine where the array begins in memory. Since you specify the starting segment and address, it is simple to begin sorting at any array element. For example, to sort elements 100 through 200--even if the array is larger than that--you'd use VARSEG(Array(100)) and VARPTR(Array(100) instead of element 1 as shown in this example. In addition to the starting segment and address of the array, TypeSort also requires you to tell it how many elements to consider. If you are sorting the entire array and the array starts with element 1, this will simply be the UBOUND of the array. If you are sorting just a portion of the array then you give it only the number of elements to be sorted. So to sort elements 100 through 200, the number of elements will be 101. A general formula you can use for calculating this based on element numbers is NumElements = LastElement - FirstElement + 1. Now let's consider the TypeSort subprogram itself. Since it is more like the earlier QSort program than different, I will cover only the differences here. In fact, the primary difference is in the way comparisons and exchanges are handled. The Compare3 function introduced earlier is used to compare the array elements with the midpoint. Although QSort made a temporary copy of the midpoint element, that would be difficult to do here. Since the routine is designed to work with any type of data--and the size of each element can vary depending on the TYPE structure--it is impractical to make a copy. While SPACE$ could be used to claim a block of memory into which the midpoint element is copied, there's a much better way: the Temp variable is used to remember the element number itself. The only complication is that once elements I and J are swapped, Temp must be reassigned if it was equal to either of them. (This happens just below the call to SwapMem.) But the simple integer IF test and assignment required adds far less code and is much faster than making a copy of the element. TypeSort is designed to sort the array in ascending order, and comments in the code show how to change it to sort descending instead. If you prefer to have one subprogram that can do both, you should add an extra parameter, perhaps called Direction. Near the beginning of the routine before the initial outer DO you would add this: IF Direction = 0 THEN 'sort ascending ICompare = -1 JCompare = 1 ELSE 'sort descending ICompare = 1 JCompare = -1 END IF Then, where the results from Compare3 are compared to -1 and 1 replace those comparisons (at the end of each WHILE line) to instead use ICompare and JCompare: WHILE Compare3%(...) = ICompare I = I + 1 WEND WHILE Compare3%(...) = JCompare J = J - 1 WEND This way, you are using variables to establish the sorting direction, and those variables can be set either way each time TypeSort is called. The last major difference is that elements are exchanged using the SwapMem routine rather than BASIC's SWAP statement. While it is possible to call SWAP by aliasing its name as shown in Chapter 5, it was frankly simpler to write a new routine for this purpose. Further, BASIC's SWAP is slower than SwapMem because it must be able to handle variables of different lengths, and also exchange fixed-length and conventional strings. SwapMem is extremely simple, and it works very quickly. As I stated earlier, the only way to write a truly generic sort routine is by passing segments and addresses and bytes, instead of array names. Although it would be great if BASIC could let you declare a subprogram or function using the AS ANY option to allow any type of data, that simply wouldn't work. As BASIC compiles your program, it needs to know the size and type of each parameter. When you reference TypeVar.LastName, BASIC knows where within TypeVar the LastName portion begins, and uses that in its address calculations. It is not possible to avoid this limitation other than by using addresses as is done here. Indeed, this is the stuff that C and assembly language programs are made of. In these languages--especially assembly language--integer pointer variables are used extensively to show where data is located and how long it is. However, the formulas used within the Compare3 and SwapMem function calls are not at all difficult to understand. The formula Address + Offset - (I - 1) * ElSize indicates where the key field of element I begins. Address holds the address of the beginning of the first element, and Offset is added to identify the start of the first sort key. (I - 1) is used instead of I because addresses are always zero- based. That is, the first element in the array from TypeSort's perspective is element 0, even though the calling program considers it to be element 1. Finally, the element number is multiplied times the length of each element, to determine the value that must be added to the starting address and offset to obtain the final address for the data in element I. Please understand that calculations such as these are what the compiler must do each time you access an array element. Note that if you call TypeSort incorrectly or give it illegal element numbers, you will not receive a "Subscript out of range" error from BASIC. Rather, you will surely crash your PC and have to reboot. This is the danger--and fun--of manipulating pointers directly. As I stated earlier, the SwapMem routine that does the actual exchanging of elements is very simple, and it merely takes a byte from one element and exchanges it with the corresponding byte in the other. This task is greatly simplified by the use of the XCHG assembly language command, which is similar to BASIC's SWAP statement. Although XCHG cannot swap a word in memory with another word in memory, it can exchange memory with a register. SwapMem is shown in the listing below. ;SWAPMEM.ASM, swaps two sections of memory .Model Medium, Basic .Code SwapMem Proc Uses SI DI DS ES, Var1:DWord, Var2:DWord, NumBytes:Word Lds SI,Var1 ;get the segmented address of the ; first variable Les DI,Var2 ;and for the second variable Mov CX,NumBytes ;get the number of bytes to exchange Jcxz Exit ;we can't swap zero bytes! DoSwap: Mov AL,ES:[DI] ;get a byte from the second variable Xchg AL,[SI] ;swap it with the first variable Stosb ;complete the swap and increment DI Inc SI ;point to the next source byte Loop DoSwap ;continue until done Exit: Ret ;return to BASIC SwapMem Endp End INDEXED SORTING ON MULTIPLE KEYS Earlier I showed how to modify the simple Bubble Sort routine to sort a parallel index array instead of the primary array. One important reason you might want to do that is to allow access to the primary array in both its original and sorted order. Another reason, and one we will get to shortly, is to facilitate sorting disk files. Although a routine to sort the records in a file could swap the actual data, it takes a long time to read and write that much data on disk. Further, each time you wanted to access the data sorted on a different key, the entire file would need to be sorted again. A much better solution is to create one or more sorted lists of record numbers, and store those on disk each in a separate file. This lets you access the data sorted by name, or by zip code, or by any other field, without ever changing the actual file. The TypeISort subprogram below is adapted from TypeSort, and it sorts an index array that holds the element numbers of a TYPE array. 'TYPISORT.BAS, indexed multi-key sort for TYPE arrays DEFINT A-Z DECLARE FUNCTION Compare3% (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, _ BYVAL Adr2, NumBytes) DECLARE SUB SwapMem (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, _ BYVAL Adr2, BYVAL Length) DECLARE SUB TypeISort (Segment, Address, ElSize, Offset, KeySize, _ NumEls, Index()) CONST NumEls% = 23 'this keeps it all on the screen TYPE MyType LastName AS STRING * 10 FirstName AS STRING * 10 Dollars AS STRING * 6 Cents AS STRING * 2 END TYPE REDIM Array(1 TO NumEls%) AS MyType REDIM Index(1 TO NumEls%) 'create the index array Offset = 1 'start sorting with LastName ElSize = LEN(Array(1)) 'the length of each element KeySize = ElSize 'sort based on all 4 fields FOR X = 1 TO NumEls% 'build the array from DATA READ Array(X).LastName READ Array(X).FirstName READ Amount$ ... 'this continues as already ... ' shown in TypeSort NEXT FOR X = 1 TO NumEls% 'initialize the index Index(X) = X - 1 'but starting with 0 NEXT Segment = VARSEG(Array(1)) 'show where the array is Address = VARPTR(Array(1)) ' located in memory CALL TypeISort(Segment, Address, ElSize, Offset, KeySize, NumEls%, Index()) CLS 'display the results FOR X = 1 TO NumEls% '+ 1 adjusts to one-based PRINT Array(Index(X) + 1).LastName, PRINT Array(Index(X) + 1).FirstName, PRINT Array(Index(X) + 1).Dollars; "."; PRINT Array(Index(X) + 1).Cents NEXT DATA Smith, John, 123.45 'this continues as already ... ' shown in TypeSort ... END SUB TypeISort (Segment, Address, ElSize, Displace, KeySize, NumEls, _ Index()) STATIC REDIM QStack(NumEls \ 5 + 10) 'create a stack First = 1 'initialize working variables Last = NumEls Offset = Displace - 1 'make zero-based now for speed later DO DO Temp = (Last + First) \ 2 'seek midpoint I = First J = Last DO 'change -1 to 1 and 1 to -1 to sort descending WHILE Compare3%(Segment, Address + Offset + (Index(I) * ElSize), _ Segment, Address + Offset + (Index(Temp) * ElSize), KeySize) = -1 I = I + 1 WEND WHILE Compare3%(Segment, Address + Offset + (Index(J) * ElSize), _ Segment, Address + Offset + (Index(Temp) * ElSize), KeySize) = 1 J = J - 1 WEND IF I > J THEN EXIT DO IF I < J THEN SWAP Index(I), Index(J) IF Temp = I THEN Temp = J ELSEIF Temp = J THEN Temp = I END IF END IF I = I + 1 J = J - 1 LOOP WHILE I <= J IF I < Last THEN QStack(StackPtr) = I 'Push I QStack(StackPtr + 1) = Last 'Push Last StackPtr = StackPtr + 2 END IF Last = J LOOP WHILE First < Last IF StackPtr = 0 THEN EXIT DO 'Done StackPtr = StackPtr - 2 First = QStack(StackPtr) 'Pop First Last = QStack(StackPtr + 1) 'Pop Last LOOP ERASE QStack 'delete the stack array END SUB As with TypeSort, TypeISort is entirely pointer based so it can be used with any type of data and it can sort multiple contiguous keys. The only real difference is the addition of the Index() array parameter, and the extra level of indirection needed to access the index array each time a comparison is made. Also, when a swap is required, only the integer index elements are exchanged, which simplifies the code and reduces its size. Like TypeSort, you can change the sort direction by reversing the -1 and 1 values used with Compare3, or add a Direction parameter to the list and modify the code to use that. As with BubbleISort, the index array is initialized to increasing values by the calling program; however, here the first element is set to hold a value of 0 instead of 1. This reduces the calculations needed within the routine each time an address must be obtained. Therefore, when TypeISort returns, the caller must add 1 to the element number held in each index element. This is shown within the FOR/NEXT loop that displays the sorted results. SORTING FILES With the development of TypeISort complete, we can now use that routine to sort disk files. The sorting strategy will be to determine how many records are in the file, to determine how many separate passes are needed to process the entire file. TypeISort and TypeSort are restricted to working with arrays no larger than 64K (32K in the editing environment), so there is a limit as to how much data may be loaded into memory at one time. These sort routines can accommodate more data when compiled because address calculations that result in values larger than 32767 cause an overflow error in the QB editor. This overflow is in fact harmless, and is ignored in a compiled program unless you use the /d switch. Although the routines could be modified to perform segment and address arithmetic to accommodate larger arrays, that still wouldn't solve the problem of having more records than can fit in memory at once. Therefore, separate passes must be used to sort the file contents in sections, with each pass writing a temporary index file to disk. A final merge pass then reads each index to determine which pieces fits where, and then writes the final index file. The program FILESORT.BAS below incorporates all of the sorting techniques shown so far, and includes a few custom BASIC routines to improve its performance. 'FILESORT.BAS, indexed multi-key random access file sort DEFINT A-Z DECLARE FUNCTION Compare3% (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, _ BYVAL Adr2, NumBytes) DECLARE FUNCTION Exist% (FileSpec$) DECLARE SUB DOSInt (Registers AS ANY) DECLARE SUB FileSort (FileName$, NDXName$, RecLength, Offset, KeySize) DECLARE SUB LoadFile (FileNum, Segment, Address, Bytes&) DECLARE SUB SaveFile (FileNum, Segment, Address, Bytes&) DECLARE SUB SwapMem (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, BYVAL Adr2, _ BYVAL Length) DECLARE SUB TypeISort (Segment, Address, ElSize, Offset, KeySize, _ NumEls, Index()) RANDOMIZE TIMER 'create new data each run DEF FnRand% = INT(RND * 10 + 1) 'returns RND from 1 to 10 TYPE RegType 'used by DOSInt AX AS INTEGER BX AS INTEGER CX AS INTEGER DX AS INTEGER BP AS INTEGER SI AS INTEGER DI AS INTEGER FL AS INTEGER DS AS INTEGER ES AS INTEGER END TYPE DIM SHARED Registers AS RegType 'share among all subs REDIM LastNames$(1 TO 10) 'we'll select names at REDIM FirstNames$(1 TO 10) ' random from these NumRecords = 2988 'how many test records to use FileName$ = "TEST.DAT" 'really original, eh? NDXName$ = "TEST.NDX" 'this is the index file name TYPE RecType LastName AS STRING * 11 FirstName AS STRING * 10 Dollars AS STRING * 6 Cents AS STRING * 2 AnyNumber AS LONG 'this shows that only key OtherNum AS LONG ' information must be ASCII END TYPE FOR X = 1 TO 10 'read the possible last names READ LastNames$(X) NEXT FOR X = 1 TO 10 'and the possible first names READ FirstNames$(X) NEXT DIM RecordVar AS RecType 'to create the sample file RecLength = LEN(RecordVar) 'the length of each record CLS PRINT "Creating a test file..." IF Exist%(FileName$) THEN 'if there's an existing file KILL FileName$ 'kill the old data from prior END IF ' runs to start fresh IF Exist%(NDXName$) THEN 'same for any old index file KILL NDXName$ END IF '---- Create some test data and write it to the file OPEN FileName$ FOR RANDOM AS #1 LEN = RecLength FOR X = 1 TO NumRecords RecordVar.LastName = LastNames$(FnRand%) RecordVar.FirstName = FirstNames$(FnRand%) Amount$ = STR$(RND * 10000) Dot = INSTR(Amount$, ".") IF Dot THEN RSET RecordVar.Dollars = LEFT$(Amount$, Dot - 1) RecordVar.Cents = LEFT$(MID$(Amount$, Dot + 1) + "00", 2) ELSE RSET RecordVar.Dollars = Amount$ RecordVar.Cents = "00" END IF RecordVar.AnyNumber = X PUT #1, , RecordVar NEXT CLOSE '----- Created a sorted index based on the main data file Offset = 1 'start sorting with LastName KeySize = 29 'sort based on first 4 fields PRINT "Sorting..." CALL FileSort(FileName$, NDXName$, RecLength, Offset, KeySize) '----- Display the results CLS VIEW PRINT 1 TO 24 LOCATE 25, 1 COLOR 15 PRINT "Press any key to pause/resume"; COLOR 7 LOCATE 1, 1 OPEN FileName$ FOR RANDOM AS #1 LEN = RecLength OPEN NDXName$ FOR BINARY AS #2 FOR X = 1 TO NumRecords GET #2, , ThisRecord 'get next rec. number GET #1, ThisRecord, RecordVar 'then the actual data PRINT RecordVar.LastName; 'print each field PRINT RecordVar.FirstName; PRINT RecordVar.Dollars; "."; PRINT RecordVar.Cents IF LEN(INKEY$) THEN 'pause on a keypress WHILE LEN(INKEY$) = 0: WEND END IF NEXT CLOSE VIEW PRINT 1 TO 24 'restore the screen END DATA Smith, Cramer, Malin, Munro, Passarelli DATA Bly, Osborn, Pagliaro, Garcia, Winer DATA John, Phil, Paul, Anne, Jacki DATA Patricia, Ethan, Donald, Tami, Elli END FUNCTION Exist% (Spec$) STATIC 'reports if a file exists DIM DTA AS STRING * 44 'the work area for DOS DIM LocalSpec AS STRING * 60 'guarantee the spec is in LocalSpec$ = Spec$ + CHR$(0) ' DGROUP for BASIC PDS Exist% = -1 'assume true for now Registers.AX = &H1A00 'assign DTA service Registers.DX = VARPTR(DTA) 'show DOS where to place it Registers.DS = VARSEG(DTA) CALL DOSInt(Registers) Registers.AX = &H4E00 'find first matching file Registers.CX = 39 'any file attribute okay Registers.DX = VARPTR(LocalSpec) Registers.DS = VARSEG(LocalSpec) CALL DOSInt(Registers) 'see if there's a match IF Registers.FL AND 1 THEN 'if the Carry flag is set Exist% = 0 ' there were no matches END IF END FUNCTION SUB FileSort (FileName$, NDXName$, RecLength, Displace, KeySize) STATIC CONST BufSize% = 32767 'holds the data being sorted Offset = Displace - 1 'make zero-based for speed later '----- Open the main data file FileNum = FREEFILE OPEN FileName$ FOR BINARY AS #FileNum '----- Calculate the important values we'll need NumRecords = LOF(FileNum) \ RecLength RecsPerPass = BufSize% \ RecLength IF RecsPerPass > NumRecords THEN RecsPerPass = NumRecords NumPasses = (NumRecords \ RecsPerPass) - ((NumRecords MOD RecsPerPass) _ <> 0) IF NumPasses = 1 THEN RecsLastPass = RecsPerPass ELSE RecsLastPass = NumRecords MOD RecsPerPass END IF '----- Create the buffer and index sorting arrays REDIM Buffer(1 TO 1) AS STRING * BufSize REDIM Index(1 TO RecsPerPass) IndexAdjust = 1 '----- Process all of the records in manageable groups FOR X = 1 TO NumPasses IF X < NumPasses THEN 'if not the last pass RecsThisPass = RecsPerPass 'do the full complement ELSE 'the last pass may have RecsThisPass = RecsLastPass ' fewer records to do END IF FOR Y = 1 TO RecsThisPass 'initialize the index Index(Y) = Y - 1 'starting with value of 0 NEXT '----- Load a portion of the main data file Segment = VARSEG(Buffer(1)) 'show where the buffer is CALL LoadFile(FileNum, Segment, Zero, RecsThisPass * CLNG(RecLength)) CALL TypeISort(Segment, Zero, RecLength, Displace, KeySize, _ RecsThisPass, Index()) '----- Adjust the zero-based index to record numbers FOR Y = 1 TO RecsThisPass Index(Y) = Index(Y) + IndexAdjust NEXT '----- Save the index file for this pass TempNum = FREEFILE OPEN "$$PASS." + LTRIM$(STR$(X)) FOR OUTPUT AS #TempNum CALL SaveFile(TempNum, VARSEG(Index(1)), Zero, RecsThisPass * 2&) CLOSE #TempNum '----- The next group of record numbers start this much higher IndexAdjust = IndexAdjust + RecsThisPass NEXT ERASE Buffer, Index 'free up the memory '----- Do a final merge pass if necessary IF NumPasses > 1 THEN NDXNumber = FREEFILE OPEN NDXName$ FOR BINARY AS #NDXNumber REDIM FileNums(NumPasses) 'this holds the file numbers REDIM RecordNums(NumPasses) 'this holds record numbers REDIM MainRec$(1 TO NumPasses) 'holds main record data REDIM Remaining(1 TO NumPasses) 'tracks index files '----- Open the files and seed the first round of data FOR X = 1 TO NumPasses FileNums(X) = FREEFILE OPEN "$$PASS." + LTRIM$(STR$(X)) FOR BINARY AS #FileNums(X) Remaining(X) = LOF(FileNums(X)) 'this is what remains MainRec$(X) = SPACE$(RecLength) 'holds main data file GET #FileNums(X), , RecordNums(X) 'get the next record number RecOffset& = (RecordNums(X) - 1) * CLNG(RecLength) + 1 GET #FileNum, RecOffset&, MainRec$(X) 'then get the data NEXT FOR X = 1 TO NumRecords Lowest = 1 'assume this is the lowest data in the group WHILE Remaining(Lowest) = 0 'Lowest can't refer to a dead index Lowest = Lowest + 1 'so seek to the next higher active index WEND FOR Y = 2 TO NumPasses 'now seek out the truly lowest element IF Remaining(Y) THEN 'consider only active indexes IF Compare3%(SSEG(MainRec$(Y)), _ '<-- use VARSEG with QB SADD(MainRec$(Y)) + Offset, _ SSEG(MainRec$(Lowest)), _ '<-- use VARSEG with QB SADD(MainRec$(Lowest)) + Offset, KeySize) = -1 THEN Lowest = Y END IF END IF NEXT PUT #NDXNumber, , RecordNums(Lowest) 'write the main index Remaining(Lowest) = Remaining(Lowest) - 2 IF Remaining(Lowest) THEN 'if the index is still active GET #FileNums(Lowest), , RecordNums(Lowest) RecOffset& = (RecordNums(Lowest) - 1) * CLNG(RecLength) + 1 GET #FileNum, RecOffset&, MainRec$(Lowest) END IF NEXT ELSE '----- Only one pass was needed so simply rename the index file NAME "$$PASS.1" AS NDXName$ END IF CLOSE 'close all open files IF Exist%("$$PASS.*") THEN 'ensure there's a file to kill KILL "$$PASS.*" 'kill the work files END IF ERASE FileNums, RecordNums 'erase the work arrays ERASE MainRec$, Remaining END SUB SUB LoadFile (FileNum, Segment, Address, Bytes&) STATIC IF Bytes& > 32767 THEN Bytes& = Bytes& - 65536 Registers.AX = &H3F00 'read from file service Registers.BX = FILEATTR(FileNum, 2) 'get the DOS handle Registers.CX = Bytes& 'how many bytes to load Registers.DX = Address 'and at what address Registers.DS = Segment 'and at what segment CALL DOSInt(Registers) END SUB SUB SaveFile (FileNum, Segment, Address, Bytes&) STATIC IF Bytes& > 32767 THEN Bytes& = Bytes& - 65536 Registers.AX = &H4000 'write to file service Registers.BX = FILEATTR(FileNum, 2) 'get the DOS handle Registers.CX = Bytes& 'how many bytes to load Registers.DX = Address 'and at what address Registers.DS = Segment 'and at what segment CALL DOSInt(Registers) END SUB SUB TypeISort (....) STATIC 'as shown in TYPISORT.BAS END SUB FILESORT.BAS begins by defining a function that returns a random number between 1 and 10. Although the earlier sort demonstrations simply read the test data from DATA statements, that is impractical when creating thousands of records. Instead, two arrays are filled--one with ten last names and another with ten first names--and these names are drawn from at random. The Registers TYPE variable that is defined is used by three of the supporting routines in this program. RegType is normally associated with CALL Interrupt and InterruptX, but I have written a small-code replacement to mimic InterruptX that works with DOS Interrupt &H21 only. DOSInt accepts just a single Registers argument, instead of the three parameters that BASIC's Interrupt and InterruptX require. Besides adding less code each time it is used, the routine itself is smaller and simpler than InterruptX. The remainder of the demonstration program should be easy to follow, so I won't belabor its operation; the real action is in the FileSort subprogram. Like TypeSort and TypeISort, FileSort is entirely pointer based, to accommodate TYPE elements of any size and structure. You provide the name of the main data file to be sorted, the name of an index file to create, and the length and offset of the keys within the disk records. The Displace parameter tells how far into the TYPE structure the key information is located. When calling TypeISort this value is should be one-based, but in the final merge pass where Compare3 is used, a zero-based number is required. Therefore, a copy is made (Offset = Displace - 1) near the beginning of the routine. This way, both are available quickly without having to calculate - 1 repeatedly slowing its operation. The initial steps FileSort performs are to determine how many records are in the data file, and from that how many records can fit into memory at one time. Once these are known, the number of passes necessary can be easily calculated. An extra step is needed to ensure that RecsPerPass is not greater than the number of records in the file. Just because 200 records can fit into memory at once doesn't mean there are that many records. In most cases where multiple passes are needed the last pass will process fewer records than the others. If there are, say, 700 records and each pass can sort 300, the last pass will sort only 100 records. Once the pass information is determined, a block of memory is created to hold each portion of the file for sorting. This is the purpose of the Buffer array. REDIM is used to create a 32K chunk of memory that doesn't impinge on available string space. For each pass that is needed, the number of records in the current pass is determined and the index array is initialized to increasing values. Then, a portion of the main data file is read using the LoadFile subprogram. BASIC does not allow you to read records from a random access file directly into a buffer specified by its address. And even if it did, we can load data much faster than pure BASIC by reading a number of records all at once. Once the current block of records has been loaded, TypeISort is called to sort the index array. The index array is also saved very quickly using SaveFile, which is the compliment to LoadFile. A unique name is given to each temporary index file such that the first one is named $$PASS.1, the second $$PASS.2, and so forth. By using dollar signs in the name it is unlikely that the routine will overwrite an existing file from another application. Of course, you may change the names to anything else if you prefer. Notice the extra step that manipulates the IndexAdjust variable. This adjustment is needed because each sort pass returns the index array holding record numbers starting at 0. The first time through, 1 must be added to each element to reflect BASIC's use of record numbers that start at 1. If the first pass sorts, say, 250 records, then the index values 1 through 250 are saved to disk. But the second pass is processing records 251 through 500, so an adjustment value of 251 must be added to each element prior to writing it to disk. If the data file is small and only one pass was needed, the $$PASS.1 file is simply renamed to whatever the caller specified. Otherwise, a merge pass is needed to determine which record number is the next in sequence based on the results of each pass. Believe it or not, this is the trickiest portion of the entire program. For the sake of discussion, we'll assume that four passes were required to sort the file. Each of the four index files contains a sequence of record numbers, and all of the records within that sequence are in sorted order. However, there is no relationship between the data records identified in one index file and those in another. Thus, each index file and corresponding data record must be read in turn. A FOR/NEXT loop then compares each of the four records, to see which is truly next in the final sequence. The complication arises as the merge nears completion, because some of the indexes will have become exhausted. This possibility is handled by the Remaining array. Elements in the Remaining array are initialized to the length of each index file as the indexes are opened. Then, as each index entry is read from disk, the corresponding element is decremented by two to show that another record number was read. Therefore, the current Remaining element must be checked to see if that index has been exhausted. Otherwise, data that was already processed might be considered in the merge comparisons. The final steps are to close all the open files, delete the temporary index files, and erase the work arrays to free the memory they occupied. One important point to observe is the use of SSEG to show Compare3 where the MainRec$ elements are located. SSEG is for BASIC 7 only; if you are using QuickBASIC you must change SSEG to VARSEG. SSEG can be used with either near or far strings in BASIC 7, but VARSEG works with near strings only. SSEG is used as the default, so an error will be reported if you are using QuickBASIC. The cursor will then be placed near the comment in the program that shows the appropriate correction. SEARCHING FUNDAMENTALS ====================== As with sorting, searching data effectively also requires that you select an appropriate algorithm. There are many ways to search data, and we will look at several methods here. The easiest to understand is a linear search, which simply examines each item in sequence until a match is found: FoundAt = 0 'assume no match FOR X = 1 TO NumElements 'search all elements IF Array$(X) = Sought$ THEN FoundAt = X 'remember where it is EXIT FOR 'no need to continue END IF NEXT IF FoundAt THEN 'if it was found PRINT "Found at element"; FoundAt ELSE PRINT "Not found" 'otherwise END IF For small arrays a linear search is effective and usually fast enough. Also, integer and long integer arrays can be searched reasonably quickly even if there are many elements. But with string data, as the number of elements that must be searched increases, the search time can quickly become unacceptable. This is particularly true when additional features are required such as searching without regard to capitalization or comparing only a portion of each element using MID$. Indeed, many of the same techniques that enhance a sort routine can also be employed when searching. To search ignoring capitalization you would first capitalize Sought$ outside of the loop, and then use UCASE$ with each element in the comparisons. Using UCASE$(Sought$) repeatedly within the loop is both wasteful and unnecessary: Sought$ = UCASE$(Sought$) . . IF UCASE$(Array$(X)) = Sought$ THEN Likewise, comparing only a portion of each string will require MID$ with each comparison, after using MID$ initially to extract what is needed from Sought$: Sought$ = MID$(Sought$, 12, 6) . . IF MID$(Array$(X), 12, 6) = Sought$ THEN And again, as with sorting, these changes may be combined in a variety of ways. You could even use INSTR to see if the string being searched for is within the array, when an exact match is not needed: IF INSTR(UCASE$(Array$(X)), Sought$) THEN However, each additional BASIC function you use will make the searching slower and slower. Although BASIC's INSTR is very fast, adding UCASE$ to each comparison as shown above slows the overall process considerably. There are three primary ways that searching can be speeded up. One is to apply simple improvements based on understanding how BASIC works, and knowing which commands are fastest. The other is to select a better algorithm. The third is to translate selected portions of the search routine into assembly language. I will use all three of these techniques here, starting with enhancements to the linear search, and culminating with a very fast binary search for use with sorted data. One of the slowest operations that BASIC performs is comparing strings. For each string, its descriptor address must be loaded and passed to the comparison routine. That routine must then obtain the actual data address, and examine each byte in both strings until one of the characters is different, or it determines that both strings are the same. As I mentioned earlier, if one or both of the strings are fixed-length, then copies also must be made before the comparison can be performed. There is another service that the string comparison routine must perform, which is probably not obvious to most programmers and which also impacts its speed. BASIC frequently creates and then deletes temporary strings without your knowing it. One example is the copy it makes of fixed-length strings before comparing them. But there are other, more subtle situations in which this can happen. For example, when you use IF X$ + Y$ > Z$ BASIC must create a temporary string comprised of X$ + Y$, and then pass that to the comparison routine. Therefore, that routine is also responsible for determining if the incoming string is a temporary copy, and deleting it if so. In fact, all of BASIC's internal routines that accept string arguments are required to do this. Therefore, one good way to speed searching of conventional (not fixed- length) string arrays is to first compare the lengths. Since strings whose lengths are different can't possibly be the same, this will quickly weed those out. BASIC's LEN function is much faster than its string compare routine, and it offers a simple but effective opportunity to speed things up. LEN is made even faster because it requires only a single argument, as opposed to the two required for the comparison routine. SLen = LEN(Sought$) 'do this once outside the loop FOR X = 1 TO NumElements IF LEN(Array$(X)) = SLen THEN 'maybe... IF Array$(X) = Sought$ THEN 'found it! FoundAt = X EXIT FOR END IF END IF NEXT Similarly, if the first characters are not the same then the strings can't match either. Like LEN, BASIC's ASC is much faster than the full string comparison routine, and it too can improve search time by eliminating elements that can't possibly match. Depending on the type and distribution of the data in the array, using both LEN and ASCII can result in a very fast linear search: SLen = LEN(Sought$) SAsc = ASC(Sought$) FOR X = 1 TO NumElements IF LEN(Array$(X)) = SLen THEN IF ASC(Array$(X)) = SAsc THEN IF Array$(X) = Sought$ THEN ... END IF END IF END IF NEXT Notice that the LEN test must always be before the ASC test, to avoid an "Illegal function call" error if the array element is a null string. If all or most of the strings are the same length, then LEN will not be helpful, and ASC should be used alone. As I mentioned before, when comparing fixed-length string arrays BASIC makes a copy of each element into a conventional string, prior to calling its comparison routine. This copying is also performed when using ASC is used, but not LEN. After all, the length of a fixed-length never changes, and BASIC is smart enough to know the length directly. But then, comparing the lengths of these string is pointless anyway. Because of the added overhead to make these copies, the performance of a conventional linear search for fixed-length data is generally quite poor. This is a shame, because fixed-length strings are often the only choice when as much data as possible must be kept in memory at once. And fixed- length strings lend themselves perfectly to names and addresses. It should be apparent by now that the best solution for quickly comparing fixed- length string arrays--and the string portion of TYPE arrays too--is with the various Compare functions already shown. If you are searching for an exact match, then either Compare or Compare2 will be ideal, depending on whether you want to ignore capitalization. If you have only a single string element in each array, you should define a dummy TYPE. This avoids the overhead of having to use both VARSEG and VARPTR as separate arguments. The short example program and SearchType functions that follow search a fixed-length string array for a match. DEFINT A-Z DECLARE FUNCTION Compare% (SEG Type1 AS ANY, SEG Type2 AS ANY, NumBytes) DECLARE FUNCTION Compare2% (SEG Type1 AS ANY, SEG Type2 AS ANY, NumBytes) DECLARE FUNCTION SearchType% (Array() AS ANY, Sought AS ANY) DECLARE FUNCTION SearchType2% (Array() AS ANY, Sought AS ANY) DECLARE FUNCTION SearchType3% (Array() AS ANY, Searched AS ANY) CLS TYPE FLen 'this lets us use SEG LastName AS STRING * 15 END TYPE REDIM Array(1 TO 4000) AS FLen '4000 is a lot of names DIM Search AS FLen 'best comparing like data FOR X = 1 TO 4000 STEP 2 'impart some realism Array(X).LastName = "Henderson" NEXT Array(4000).LastName = "Henson" 'almost at the end Search.LastName = "Henson" 'find the same name '----- first time how long it takes using Compare Start! = TIMER 'start timing FOR X = 1 TO 5 'search five times FoundAt = SearchType%(Array(), Search) NEXT IF FoundAt >= 0 THEN PRINT "Found at element"; FoundAt ELSE PRINT "Not found" END IF Done! = TIMER PRINT USING "##.## seconds with Compare"; Done! - Start! PRINT '----- then time how long it takes using Compare2 Start! = TIMER 'start timing FOR X = 1 TO 5 'as above FoundAt = SearchType2%(Array(), Search) NEXT IF FoundAt >= 0 THEN PRINT "Found at element"; FoundAt ELSE PRINT "Not found" END IF Done! = TIMER PRINT USING "##.## seconds with Compare2"; Done! - Start! PRINT '---- finally, time how long it takes using pure BASIC Start! = TIMER FOR X = 1 TO 5 FoundAt = SearchType3%(Array(), Search) NEXT IF FoundAt >= 0 THEN PRINT "Found at element"; FoundAt ELSE PRINT "Not found" END IF Done! = TIMER PRINT USING "##.## seconds using BASIC"; Done! - Start! END FUNCTION SearchType% (Array() AS FLen, Sought AS FLen) STATIC SearchType% = -1 'assume not found FOR X = LBOUND(Array) TO UBOUND(Array) IF Compare%(Array(X), Sought, LEN(Sought)) THEN SearchType% = X 'save where it was found EXIT FOR 'and skip what remains END IF NEXT END FUNCTION FUNCTION SearchType2% (Array() AS FLen, Sought AS FLen) STATIC SearchType2% = -1 'assume not found FOR X = LBOUND(Array) TO UBOUND(Array) IF Compare2%(Array(X), Sought, LEN(Sought)) THEN SearchType2% = X 'save where it was found EXIT FOR 'and skip what remains END IF NEXT END FUNCTION FUNCTION SearchType3% (Array() AS FLen, Searched AS FLen) STATIC SearchType3% = -1 'assume not found FOR X = LBOUND(Array) TO UBOUND(Array) IF Array(X).LastName = Searched.LastName THEN SearchType3% = X 'save where it was found EXIT FOR 'and skip what remains END IF NEXT END FUNCTION When you run this program it will be apparent that the SearchType function is the fastest, because it uses Compare which doesn't perform any case conversions. SearchType2 is only slightly slower with that added overhead, and the purely BASIC function, SearchType3, lags far behind at half the speed. Note that the array is searched five times in succession, to minimize the slight errors TIMER imposes. Longer timings are generally more accurate than short ones, because of the 1/18th second resolution of the PC's system timer. BINARY SEARCHES This is about as far as we can go using linear searching, and to achieve higher performance requires a better algorithm. The Binary Search is one of the fastest available; however, it requires the data to already be in sorted order. A Binary Search can also be used with a sorted index, and both methods will be described. Binary searches are very fast, and also very simple to understand. Unlike the Quick Sort algorithm which achieves great efficiency at the expense of being complicated, a Binary Search can be written using only a few lines of code. The strategy is to start the search at the middle of the array. If the value of that element value is less than that of the data being sought, a new halfway point is checked and the process repeated. This way, the routine can quickly zero in on the value being searched for. Figure 8-3 below shows how this works. 13: Zambia 12: Sweden 11: Peru 10: Mexico <-- step 2 9: Holland 8: Germany 7: Finland <-- step 1 6: England 5: Denmark 4: China 3: Canada 2: Austria 1: Australia Figure 8-3: How a Binary Search locates data in a sorted array. If you are searching for Mexico, the first element examined is number 7, which is halfway through the array. Comparing Mexico to Finland shows that Mexico is greater, so the distance is again cut in half. In this case, a match was found after only two tries--remarkably faster than a linear search that would have required ten comparisons. Even when huge arrays must be searched, data can often be found in a dozen or so tries. One interesting property of a binary search is that it takes no longer to find the last element in the array than the first one. The program below shows one way to implement a Binary Search. DEFINT A-Z DECLARE FUNCTION BinarySearch% (Array$(), Find$) CLS PRINT "Creating test data..." REDIM Array$(1 TO 1000) 'create a "sorted" array FOR X = 1 TO 1000 Array$(X) = "String " + RIGHT$("000" + LTRIM$(STR$(X)), 4) NEXT PRINT "Searching array..." FoundAt = BinarySearch%(Array$(), "String 0987") IF FoundAt >= 0 THEN PRINT "Found at element"; FoundAt ELSE PRINT "Not found" END IF END FUNCTION BinarySearch% (Array$(), Find$) STATIC BinarySearch% = -1 'no matching element yet Min = LBOUND(Array$) 'start at first element Max = UBOUND(Array$) 'consider through last DO Try = (Max + Min) \ 2 'start testing in middle IF Array$(Try) = Find$ THEN 'found it! BinarySearch% = Try 'return matching element EXIT DO 'all done END IF IF Array$(Try) > Find$ THEN 'too high, cut in half Max = Try - 1 ELSE Min = Try + 1 'too low, cut other way END IF LOOP WHILE Max >= Min END FUNCTION The BinarySearch function returns either the element number where a match was found, or -1 if the search string was not found. Not using a value of zero to indicate failure lets you use arrays that start with element number 0. As you can see, the simplicity of this algorithm belies its incredible efficiency. The only real problem is that the data must already be in sorted order. Also notice that two string comparisons must be made--one to see if the strings are equal, and another to see if the current element is too high. Although you could use Compare3 which examines the strings once and tells if the data is the same or which is greater, a Binary Search is so fast that this probably isn't worth the added trouble. As you will see when you run the test program, it takes far longer to create the data than to search it! Besides the usual enhancements that can be applied to the comparisons using UCASE$ or MID$, this function could also be structured to use a parallel index array. Assuming the data is not sorted but the index array is, the modified Binary Search would look like this: FUNCTION BinaryISearch% (Array$(), Index(), Find$) STATIC BinaryISearch% = -1 'assume not found Min = LBOUND(Array$) 'start at first element Max = UBOUND(Array$) 'consider through last DO Try = (Max + Min) \ 2 'start testing in middle IF Array$(Index(Try)) = Find$ THEN 'found it! BinaryISearch% = Try 'return matching element EXIT DO 'all done END IF IF Array$(Index(Try)) > Find$ THEN 'too high, cut Max = Try - 1 ELSE Min = Try + 1 'too low, cut other way END IF LOOP WHILE Max >= Min END FUNCTION NUMERIC ARRAYS All of the searching techniques considered so far have addressed string data. In most cases, string array searches are the ones that will benefit the most from improved techniques. As you have already seen, BASIC makes copies of fixed-length strings before comparing them, which slows down searching. And the very nature of strings implies that many bytes may have to be compared before determining if they are equal or which string is greater. In most cases, searching a numeric array is fast enough without requiring any added effort, especially when the data is integer or long integer. However, a few aspects of numeric searching are worth mentioning here. One is avoiding the inevitable rounding errors that are sure to creep into the numbers you are examining. Another is that in many cases, you may not be looking for an exact match. For example, you may need to find the first element that is higher than a given value, or perhaps determine the smallest value in an array. Unlike strings that are either the same or they aren't, the binary representation of numeric values is not always so precise. Consider the following test which *should* result in a match, but doesn't. Value! = 1! Result! = 2! CLS FOR X = 1 TO 1000 Value! = Value! + .001 NEXT IF Value! = Result! THEN PRINT "They are equal" ELSE PRINT "Value! ="; Value! PRINT "Result! ="; Result! END IF After adding .001 to Value! 1000 times Value! should be equal to 2, but instead it is slightly higher. This is because the binary storage method used by computers simply cannot represent every possible value with absolute accuracy. Even changing all of the single precision exclamation points (!) to double precision pound signs (#) will not solve the problem. Therefore, to find a given value in a numeric array can require some extra trickery. What is really needed is to determine if the numbers are *very close* to each other, as opposed to exactly the same. One way to accomplish this is to subtract the two, and see if the result is very close to zero. This is shown below. Value! = 1! Result! = 2! CLS FOR X = 1 TO 1000 Value! = Value! + .001 NEXT IF ABS(Value! - Result!) < .0001 THEN PRINT "They are equal" ELSE PRINT "Value! ="; Value! PRINT "Result! ="; Result! END IF Here, the absolute value of the difference between the numbers is examined, and if that difference is very small the numbers are assumed to be the same. Unfortunately, the added overhead of subtracting before comparing slows the comparison even further. There is no simple cure for this, and an array search must apply this subtraction to each element that is examined. Another common use for numeric array searches is when determining the largest or smallest value. Many programmers make the common mistake shown below when trying to find the largest value in an array. MaxValue# = 0 FOR X = 1 TO NumElements IF Array#(X) > MaxValue# THEN MaxValue# = Array#(X) Element = X END IF NEXT PRINT "The largest value found is"; MaxValue# PRINT "And it was found at element"; Element The problem with this routine is that it doesn't account for arrays where all of the elements are negative numbers! In that case no element will be greater than the initial MaxValue#, and the routine will incorrectly report zero as the result. The correct method is to obtain the lowest element value, and use that as a starting point: MaxValue# = Array#(1) FOR X = 2 TO NumElements IF Array#(X) > MaxValue# THEN MaxValue# = Array#(X) END IF NEXT PRINT "The largest value found is"; MaxValue# Determining the highest value in an array would be handled similarly, except the greater-than symbol (>) would be replaced with a less-than operator (<). SOUNDEX The final searching technique I will show is Soundex. It is often useful to search for data based on its sound, for example when you do not know how to spell a person's name. Soundex was invented in the 1920's and has been used since then by, among others, the U.S. Census Bureau. A Soundex code is an alpha-numeric representation of the sound of a word, and it is surprisingly accurate despite its simplicity. The classic implementation of Soundex returns a four-character result code. The first character is the same as the first letter of the word, and the other three are numeric digits coded as shown in Figure 8-4. 1 B, F, P, V 2 C, G, J, K, Q, S, X 3 D, T 4 L 5 M, N 6 R Figure 8-4: The Soundex code numbers returned for significant letters of the alphabet. Letters not shown are simply skipped as being statistically insignificant to the sound of the word. In particular, speaking accents often minimize the importance of vowels, and blur their distinction. If the string is short and there are fewer than four digits, the result is simply padded with trailing zeros. One additional rule is that a code digit is never repeated, unless there is an uncoded letter in between. In the listing that follows, two different implementations of Soundex are shown. 'SOUNDEX.BAS, Soundex routines and example DEFINT A-Z DECLARE FUNCTION ASoundex$ (Word$) DECLARE FUNCTION ISoundex% (Word$) CLS DO PRINT "press Enter alone to exit" INPUT "What is the first word"; FWord$ IF LEN(FWord$) = 0 THEN EXIT DO INPUT "What is the second word"; SWord$ PRINT 'Test by alpha-numeric soundex PRINT "Alpha-Numeric Soundex: "; FWord$; " and "; PRINT SWord$; " do "; IF ASoundex$(FWord$) <> ASoundex$(SWord$) THEN PRINT "NOT "; END IF PRINT "sound the same." PRINT 'Test by numeric soundex PRINT " Numeric Soundex: "; FWord$; " and "; PRINT SWord$; " do "; IF ISoundex%(FWord$) <> ISoundex%(SWord$) THEN PRINT "NOT "; END IF PRINT "sound the same." PRINT LOOP END FUNCTION ASoundex$ (InWord$) STATIC Word$ = UCASE$(InWord$) Work$ = LEFT$(Word$, 1) + "000" WkPos = 2 PrevCode = 0 FOR L = 2 TO LEN(Word$) Temp = INSTR("BFPVCGJKQSXZDTLMNR", MID$(Word$, L, 1)) IF Temp THEN Temp = ASC(MID$("111122222222334556", Temp, 1)) IF Temp <> PrevCode THEN MID$(Work$, WkPos) = CHR$(Temp) PrevCode = Temp WkPos = WkPos + 1 IF WkPos > 4 THEN EXIT FOR END IF ELSE PrevCode = 0 END IF NEXT ASoundex$ = Work$ END FUNCTION FUNCTION ISoundex% (InWord$) STATIC Word$ = UCASE$(InWord$) Work$ = "0000" WkPos = 1 PrevCode = 0 FOR L = 1 TO LEN(Word$) Temp = INSTR("BFPVCGJKQSXZDTLMNR", MID$(Word$, L, 1)) IF Temp THEN Temp = ASC(MID$("111122222222334556", Temp, 1)) IF Temp <> PrevCode THEN MID$(Work$, WkPos) = CHR$(Temp) PrevCode = Temp WkPos = WkPos + 1 IF WkPos > 4 THEN EXIT FOR END IF ELSE PrevCode = 0 END IF NEXT ISoundex% = VAL(Work$) END FUNCTION The first function, ASoundex, follows the standard Soundex definition and returns the result as a string. The ISoundex version cheats slightly by coding the first letter as a number, but it returns an integer value instead of a string. Because integer searches are many times faster than string searches, this version will be better when thousands--or even hundreds of thousands--of names must be examined. An additional benefit of the integer-only method is that it allows for variations on the first letter. For example, if you enter Cane and Kane in response to the prompts from SOUNDEX.BAS ASoundex will not recognize the names as sounding alike where ISoundex will. LINKED DATA =========== No discussion of searching and sorting would be complete without a mention of linked lists and other data links. Unlike arrays where all of the elements lie in adjacent memory locations, linked data is useful when data locations may be disjointed. One example is the linked list used by the DOS File Allocation Table (FAT) on every disk. As I described in Chapter 6, the data in each file may be scattered throughout the disk, and only through a linked list can DOS follow the thread from one sector in a file to another. Another example where linked data is useful--and the one we will focus on here--is to keep track of memo fields in a database. A memo field is a field that can store freeform text such as notes about a sales contact or a patient's medical history. Since these fields typically require varying lengths, it is inefficient to reserve space for the longest one possible in the main database file. Therefore, most programs store memo fields in a separate disk file, and use a *pointer field* in the main data file to show where the corresponding memo starts in the dedicated memo file. Similarly, a back pointer adjacent to each memo identifies the record that points to it. This is shown in Figure 8-5 below. ÚÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÂÄÒÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÂÄÄÒÄÄÄÄÄÄÄÄÄ MAIN.DAT ³LastName³FirstName³1ºLastName³FirstName³73ºLastNa... ÀÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÁÄÐÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÁÄÄÐÄÄÄÄÄÄÄÄÄ ^ ^ | | pointers into memo file -----+--------------------+ (forward pointers) offsets into --+----------+-------+--------------+ this memo file | | | | 1 73 126 233 ÚÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄ¿ MEMO.DAT ³1LMemo1 ³2LMemo2³3LMemo3 ³4LMemo4 ³ ÀÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÙ ^ ^ ^ ^ | | | | record numbers +----------+-------+--------------+ (back pointers) (L = length of this memo) Figure 8-5: Pointers relate record numbers to memo file offsets and vice versa. Here, the pointer in the main data file record is a long integer that holds the byte offset into the memo file where the corresponding memo text begins. And just before the memo text is an integer record number that shows which record this memo belongs to. (If you anticipate more than 65,535 records a long integer must be used instead.) Thus, these pointers provide links between the two files, and relate the information they contain. When a new record is added to the main file, the memo that goes with it is appended to the end of the memo file. BASIC's LOF function can be used to determine the current end of the memo file, which is then used as the beginning offset for the new memo text. And as the new memo is appended to MEMO.DAT, the first data actually written is the number of the new record in the main data file. The record number back pointer in the memo file is needed to allow memo data to be edited. Since there's no reasonable way to extend memo text when other memo data follows it, most programs simply abandon the old text, and allocate new space at the end of the file. The abandoned text is then marked as such, perhaps by storing a negative value as the record number. Storing a negative version of the abandoned data's length is ideal, because that both identifies the data as obsolete, and also tells how much farther into the file the next memo is located. The idea here is that you would periodically run a memo file maintenance program that compacts the file, thus eliminating the wasted space the abandoned memos occupy. This is similar to the DBPACK.BAS utility shown in Chapter 7, and also similar to the way that BASIC compacts string memory when it becomes full. But when an existing memo is relocated in the memo file, the field in the main data file that points to the memo must also be updated. And that's why the record number back pointer is needed: so the compaction program can know which record in the main file must be updated. The "L" identifier in the memo file in Figure 8-5, shown between the record number and memo text, is a length byte or word that tells how long the text is. If you plan to limit the memo field lengths to 255 or fewer characters, then a single byte is sufficient. Otherwise an integer must be used. An example of code that reads a data record and then its associated memo text is shown below. GET #MainFile, RecNumber, TypeVar MemoOffset& = TypeVar.MemoOff GET #MemoFile, MemoOffset& + 2, MemoLength% Memo$ = SPACE$(MemoLength%) GET #MemoFile, , Memo$ The first step reads a record from the main data file into a TYPE variable, and the second determines where in the memo file the memo text begins. Two is added to that offset in the second GET statement, to skip over the record number back pointer which isn't needed here. Once the length of the memo text is known, a string is assigned to that length, and the actual text is read into it. If you are using long integer record numbers you would of course use MemoOffset& + 4 in the second GET. And if you're using a single byte to hold the memo length you would define a fixed-length string to receive that byte: DIM Temp AS STRING *1 GET #MemoFile, MemoOffset& + 2, Temp MemoLength = ASC(Temp) Since BASIC doesn't offer a byte-sized integer data type, ASC and STR$ can be used to convert between numeric and string formats. ARRAY ELEMENT INSERTION AND DELETION ==================================== The last issue related to array and memory manipulation I want to cover is inserting and deleting elements. If you intend to maintain file indexes or other information in memory and in sorted order, you will need some way to insert a new entry. By the same token, deleting an entry in a database requires that the parallel index entry also be deleted. The most obvious way to insert or delete elements in an array is with a FOR/NEXT loop. The first example below inserts an element, and the second deletes one. '----- Insert an element: Element = 200 InsertValue = 999 FOR X = UBOUND(Array) TO Element + 1 STEP -1 Array(X) = Array(X - 1) NEXT Array(Element) = InsertValue '----- Delete an element: Element = 200 FOR X = Element TO UBOUND(Array) - 1 Array(X) = Array(X + 1) NEXT Array(UBOUND(Array)) = 0 'optionally clear last element For integer, long integer, and fixed-length arrays this is about as efficient as you can get, short of rewriting the code in assembly language. However, with floating point and string arrays the performance is less than ideal. Unless a numeric coprocessor is installed, floating point values are assigned using interrupts and support code in the emulator library. This adds an unnecessary level of complication that also impacts the speed. When strings are assigned the situation is even worse, because of the memory allocation overhead associated with dynamic string management. A better solution for floating point and string arrays is a series of SWAP statements. The short program below benchmarks the speed difference of the two methods, as it inserts an element into a single precision array. REDIM Array(1 TO 500) CLS Element% = 200 InsertValue = 999 Start = TIMER FOR A% = 1 TO 500 FOR X% = UBOUND(Array) TO Element% + 1 STEP -1 Array(X%) = Array(X% - 1) NEXT Array(Element%) = InsertValue NEXT Done = TIMER PRINT USING "##.## seconds when assigning"; Done - Start Start = TIMER FOR A% = 1 TO 500 FOR X% = UBOUND(Array) TO Element% + 1 STEP -1 SWAP Array(X%), Array(X% - 1) NEXT Array(Element%) = InsertValue NEXT Done = TIMER PRINT USING "##.## seconds when swapping"; Done - Start If you run this program in the BASIC environment, the differences may not appear that significant. But when the program is compiled to an executable file, the swapping method is more than four times faster. In fact, you should never compare programming methods using the BASIC editor for exactly this reason. In many cases, the slowness of the interpreting process overshadows significant differences between one approach and another. String arrays also benefit greatly from using SWAP instead of assignments, though the amount of benefit varies depending on the length of the strings. If you modify the previous program to use a string array, also add this loop to initialize the elements: FOR X% = 1 TO 500 Array$(X%) = "String number" + STR$(X) NEXT With BASIC PDS far strings the difference is only slightly less at about three to one, due to the added complexity of far data. Also, SWAP will always be worse than assignments when inserting or deleting elements in a fixed-length string or TYPE array. An assignment merely copies the data from one location to another. SWAP, however, must copy the data in both directions. Understand that when using SWAP with conventional string arrays, the data itself is not exchanged. Rather, the four-byte string descriptors are copied. But because BASIC program modules store string data in different segments, extra work is necessary to determine which descriptor goes with which segment. When near strings are being used, only six bytes are exchanged, regardless of the length of the strings. Four bytes hold the descriptors, and two more store the back pointers. SUMMARY ======= This chapter explained many of the finer points of sorting and searching all types of data in BASIC. It began with sorting concepts using the simple Bubble Sort as a model, and then went on to explain indexed and multi-key sorts. One way to implement a multi-key sort is by aligning the key fields into adjacent TYPE components. While there are some restrictions to this method, it is fairly simple to implement and also very fast. The Quick Sort algorithm was shown, and the SEEQSORT.BAS program on the accompanying disk helps you to understand this complex routine by displaying graphically the progress of the comparisons and exchanges as they are performed. Along the way you saw how a few simple modifications to any string sort routine can be used to sort regardless of capitalization, or based on only a portion of a string element. You also learned that writing a truly general sort routine that can handle any type of data requires dealing exclusively with segment and address pointers. Here, assembly language routines are invaluable for assisting you when performing the necessary comparisons and data exchanges. Although the actual operation of the assembly language routines will be deferred until Chapter 12, such routines may easily be added to a BASIC program using .LIB and .QLB libraries. I mentioned briefly the usefulness of packing and aligning data when possible, as an aid to fast sorting. In particular, dates can be packed to only three bytes in Year/Month/Day order, and other data such as zip codes can be stored in long integers. Because numbers can be compared much faster than strings, this helps the sorting routines operate more quickly. Array searching was also discussed in depth, and both linear and binary search algorithms were shown. As with the sorting routines, searching can also employ UCASE$ and MID$ to search regardless of capitalization, or on only a portion of each array element. Two versions of the Soundex algorithm were given, to let you easily locate names and other data based on how they sound. Besides showing the more traditional searching methods, I presented routines to determine the minimum and maximum values in a numeric array. I also discussed some of the ramifications involved when searching floating point data, to avoid the inevitable rounding errors that might cause a legitimate match to be ignored. Finally, some simple ways to insert and delete elements in both string and numeric arrays were shown. Although making direct assignments in a loop is the most obvious way to do this, BASIC's often-overlooked SWAP command can provide a significant improvement in speed. The next chapter will conclude this section about hands-on programming by showing a variety of program optimization techniques.